Association analysis in microbiome studies infers relationship between OTUs/ taxa. Pearson r Correlation and Spearman’s Rank Correlation are classical methods in association analysis. However, microbiome relative data is compositional and consequently make the classical methods inappropriate. If naively applying classical methods to compositional microbiome data, we likely end up with getting spurious correlations.
The review article published in 2017, “Microbiome datasets are compositional: and this is not optional” (Gloor et al. 2017)summarized four newly-invented statistical methods dealing with compositionality and achieving reliable association relationship. The four methods consists of SparCC, Spiec Easi, proportionality (ϕ and ρ).
Proportionality is a precise indicator of absolute correlation, although sensitivity is limited (Thomas P. Quinn et al. 2017).
In this project, I intend to demonstrate analysis over the HMP-IBD microbiome dataset using the above four statistical methods; and to compare outcomes based on the four methods against ones based on Spearman’s Rank Correlation.
library(tidyverse)
library(devtools)## Warning: package 'devtools' was built under R version 4.2.2
## Warning: package 'usethis' was built under R version 4.2.2
install_github("tpq/propr")
library(propr)
library(stringr)The microbiome OTU table and metadata was retrieved from ML Repo
# raw OTU table
raw_otu <- read.csv(file = "https://knights-lab.github.io/MLRepo/datasets/turnbaugh/refseq/otutable.txt",
header=T,
sep = "")The raw OTU table contains 557 features (OTUs) in 281 samples.
Shorten OTU identifiers with keeping only genus and species names,
using stringr package. And then accumulate counts of OTUs
at the species level.
head(raw_otu$X.OTU) ## [1] "NR_147401.1_Sutterella_massiliensis_strain_Marseille-P2435_16S_ribosomal_RNA__partial_sequence"
## [2] "NR_025930.1_Ruminococcus_bromii_strain_ATCC_27255_16S_ribosomal_RNA_gene__partial_sequence"
## [3] "NR_025230.1_Megasphaera_micronuciformis_strain_AIP_412.00_16S_ribosomal_RNA_gene__partial_sequence"
## [4] "NR_024994.1_Lactobacillus_mucosae_strain_S32_16S_ribosomal_RNA_gene__complete_sequence"
## [5] "NR_024750.1_Coprococcus_catus_strain_VPI-C6-61_16S_ribosomal_RNA_gene__partial_sequence"
## [6] "NR_024684.1_Eubacterium_nitritogenes_strain_JCM_6485_16S_ribosomal_RNA_gene__partial_sequence"
# split strings
otu.id <- raw_otu$X.OTU
split_otu.id <- str_split_fixed(otu.id, "_", n=6) %>%
as.data.frame()
# join genus and species fragments
join_split_otu.id <- split_otu.id %>%
mutate(taxa = str_c(V3, V4, sep = " "))
taxa <- join_split_otu.id$taxa
# merge taxa to raw otu table
raw_otu_taxa <- cbind(raw_otu, taxa)
# accumulate counts at the species level
otu_count_splevel <- raw_otu_taxa %>%
gather(key = "sample", value = "count", -c("ID", "X.OTU", "taxa")) %>%
group_by(sample, taxa) %>%
summarise(count_sp = sum(count, na.rm = T)) %>%
arrange(desc(count_sp)) %>%
ungroup()Read in the metadata from URL.
# metadata
meta <- read.csv(file = "https://knights-lab.github.io/MLRepo/datasets/turnbaugh/task-obese-lean-all.txt",
header = TRUE,
sep = "") %>%
mutate(sample = X.SampleID, is.obese = Var) %>%
select(sample, is.obese)
head(meta)The metadata contains 142 samples. Var indicates the
independent, binary variable of interest, Lean and
Obese.
Transpose the OTU table into sample x taxa format which is suitable for proportionality analysis.
otu_count_splevel2 <- otu_count_splevel %>%
spread(key = "taxa", value = count_sp) %>%
column_to_rownames("sample")Now, OTU count table at the species level contains 517 features, in 281 samples.
Filter low abundant taxa before the proportionality analysis. Keep only the taxa with at least 10 counts in at least 5 samples.
keep_index <- apply(otu_count_splevel2, 2, function(x) sum(x >= 10) >= 5) The resulting OTU table after filtering the low abundant taxa contains 154 unique species/taxa.
Add group variables to the resulting OTU table.
otu_count_splevel2_meta <- otu_count_splevel2 %>%
rownames_to_column("sample") %>%
inner_join(meta, by= "sample") %>%
column_to_rownames("sample")
head(otu_count_splevel2_meta)Check if any missing values in the OTU table. Replace missing values with zero if any.
sum(is.na(otu_count_splevel2)) ## [1] 0
# otu_count_splevel2_rmna <- otu_count_splevel2 %>%
# replace(is.na(.), 0)
#
# sum(is.na(otu_count_splevel2_rmna))Calculate a proportionality measurement, rho using
propr function, following the procedures and codes provided
by T.P.Quinn et al. (Thomas P. Quinn et al.
2019)
Set three required arguments in the wrapper propr
function as follows,
counts a count matrix with subjects as rows and
features as columns
metric the proportionality metric to calculate,
‘rho’, ‘pho’ or ‘phs’
ivar reference OTU/ feature
pr <- propr(counts = otu_count_splevel2, # a count matrix with subjects as rows and features as columns
metric = "rho", # the proportionality metric to caculate, 'rho', 'pho' or 'phs'
ivar = "clr", # reference OTU/ feature for alr
symmetrize = FALSE,
p = 100, # permutation cycles (100, by default)
select = keep_index) # filter low-abundant taxa
## alternatively, calculate phi, rho or phs using phit(), perb() and phis(), respectively.
# phi <- phit(raw_otu_t_na, symmetrize = TRUE)
#
# rho <- perb(raw_otu_t_na, ivar = 0)
#
# phs <- phis(raw_otu_t_na, ivar = 0) For proportionality, permute false discovery rate (FDR) for a given cut-off, instead of estimating parametric P-value.
# select a good cutoff for 'rho' by permuting the FDR at various cutoffs.
# below, use [0, .05, ..., .95, 1]
# pr <- updateCutoffs(pr, cutoff = seq(0,1,.05))
#
# pr@fdrAccording to the package vignette,
choose the largest cutoff that keeps the FDR below 0.05. In this case,
rho cutoff is 0.20.
Visualize proportionality with a strict cutoff, using
getNetwork function. The package vignette
describes several built-in tools for visualizing proportionality.
getNetwork(pr, cutoff = 0.2) ## IGRAPH b2b88e6 UN-- 150 969 --
## + attr: name (v/c), color (v/c), size (v/n), label (v/l), color (e/c)
## + edges from b2b88e6 (vertex names):
## [1] Pseudoflavonifractor phocaeensis--Anaeromassilibacillus senegalensis
## [2] Pseudoflavonifractor phocaeensis--Parabacteroides distasonis
## [3] Pseudoflavonifractor phocaeensis--Flavonifractor plautii
## [4] Pseudoflavonifractor phocaeensis--Flintibacter butyricus
## [5] Pseudoflavonifractor phocaeensis--Drancourtella massiliensis
## [6] Pseudoflavonifractor phocaeensis--{Ruminococcus} torques
## [7] Pseudoflavonifractor phocaeensis--{Clostridium} xylanolyticum
## [8] Pseudoflavonifractor phocaeensis--Frisingicoccus caecimuris
## + ... omitted several edges
Export highly-proportional (rho >= 0.9) features (OTU).
result_rho_df <- getResults(pr, cutoff = 0.2)
result_rho_df How to interpret proportionality? Proportionality depends on log-transformation and must get interpreted with respect to the chosen reference.
Interpret clr-based proportionality to signify a coordination that follows the general trend of the data. In other words, these proportional OTUs move together as individuals relative to how most genes move on average (Thomas P. Quinn et al. 2019).
propd functions in the propr package can
test whether proportionality of certain pairs change between binary
experimental groups.
According to the package vignette,
propd method compares the variance of log-ratio (VLR) for
one pair across groups, considering two types of differential
proportionality,
disjointed proportionality: proportionality of a
pair holds in both groups, but the ratio between the partners changes
between the groups (i.e., the slope of the proportionality
changes)
emergent proportionality: proportionality exits in
only one of the groups (i.e., the strength of the proportionality
changes)
propd function estimates differential proportionality by
calculating \(theta\) for all features
pairs.
The function takes the following arguments as input:
counts a matrix of \(n\) samples (as rows) and \(d\) features (as columns)
group an \(n\)-dimensional vector corresponding to
subject labels
alpha an optional argument to trigger and guide
transformation
p the total number of permutations used to estimate
FDR
counts <- otu_count_splevel2_meta %>% select(-is.obese)
group <- otu_count_splevel2_meta$is.obese
pd <- propd(counts, group, alpha = NA, p=100) theta_d <- setDisjointed(pd) # disjointed diff prop
theta_e <- setDisjointed(pd) # emergent diff prop theta_d_df <- getResults(theta_d)
theta_d_df %>%
arrange(theta) theta_e_df <- getResults(theta_e)
theta_e_df %>%
arrange(theta_e)plot(theta_d@counts[, "Odoribacter splanchnicus"],
theta_d@counts[, "Bacteroides massiliensis"],
col = ifelse(theta_d@group == "Obese", "red", "blue"))
grp1 <- theta_d@group == "Obese"
grp2 <- theta_d@group != "Obese"
abline(a = 0,
b = theta_d@counts[grp1, "Odoribacter splanchnicus"] / theta_d@counts[grp1, "Bacteroides massiliensis"],
col = "red")
abline(a = 0,
b = theta_d@counts[grp2, "Odoribacter splanchnicus"] / theta_d@counts[grp2, "Bacteroides massiliensis"],
col = "blue")plot(theta_d@counts[, "Odoribacter splanchnicus"] / theta_d@counts[, "Bacteroides massiliensis"],
col = ifelse(theta_d@group == "Obese", "red", "blue"))
visualize network based on differential proportionality
g <- plot(theta_d,
cutoff=300,
d3= F) # TRUE for displaying 3-D plot plot(theta_e, cutoff=200, d3=T) ## IGRAPH b724f48 UN-- 128 200 --
## + attr: name (v/c), color (v/c), size (v/n), label (v/l), color (e/c)
## + edges from b724f48 (vertex names):
## [1] {Clostridium} lavalense--{Clostridium} oroticum
## [2] {Clostridium} lavalense--Eubacterium ruminantium
## [3] {Clostridium} lavalense--{Clostridium} celerecrescens
## [4] {Clostridium} oroticum --{Clostridium} spiroforme
## [5] {Clostridium} oroticum --Blautia faecis
## [6] {Clostridium} oroticum --Citrobacter freundii
## [7] {Clostridium} oroticum --Holdemania filiformis
## [8] {Clostridium} oroticum --Neglecta timonensis
## + ... omitted several edges